Visualization & Predictive Analysis

Loading Data & Packages

options(scipen = 999)

library(tidyverse)
library(cluster)
library(factoextra)
library(caret)
library(rpart)
library(rpart.plot)
library(tidymodels)
library(DT)
library(rattle)
library(scales)
library(lemon)
library(plotly)
library(lubridate)
library(ggplot2)

myData <- read.csv('database.csv')

view(myData)

Part One - Data Exploration

Visual 1

annual_fuel_cost_by_type <- myData %>%
  filter(Fuel.Type.1 != 'Natural Gas') %>%
  ggplot() +
  geom_boxplot(show.legend = F) + 
  aes(x = Fuel.Type.1, y = Annual.Fuel.Cost..FT1., fill = Fuel.Type.1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(x = "Fuel Type", y = "Annual Fuel Cost",title = "Annual Fuel Cost by Type") +
  coord_flip() 
  #shows the $ amount for each fuel type group

ggplotly(annual_fuel_cost_by_type)

This box plot visualizes the differences in yearly fuel costs for the different gas types. It shows how much money Electric cars save, over a $1,000, versus gasoline cars yearly. Also from the data, annual fuel costs for Premium Gasoline is the highest and has the most variance.

Visual 2

myData %>%
  group_by(Fuel.Type.1) %>%
  filter(Fuel.Type.1 != 'Electricity')%>% 
  summarise(averageMPG = mean(Combined.MPG..FT1.)) %>% #calculating avg MPG per Fuel Type
  ggplot() + 
  aes(x = averageMPG, y= Fuel.Type.1 , fill = Fuel.Type.1) +
  geom_col(show.legend = F) +
  labs(x = "Average MPG", y = "Fuel Type", title = "Average MPG by Fuel Type")

Now I am comparing the average miles per gallon for cars of each fuel type. According to the visual, Diesel gasoline has the best fuel efficiency.

Visual 3

myData %>%
  group_by(Engine.Cylinders) %>%
  summarise(averageMPG = mean(Combined.MPG..FT1.)) %>%
  ggplot(aes(x = Engine.Cylinders, y = averageMPG, fill = Engine.Cylinders))+
  geom_col() +
  labs(title = "Average MPG by Engine Cylinder", y = "Average MPG", x = "Engine Cylinders", fill = "Cylinders")

  #shows MPG average grouped by cylinders

The bar graph visualizes the average combined MPG grouped by the vehicles’ engine size. This identifies that the less cylinders, the smaller the engine; therefore, the vehicle will have a greater MPG. According to the data, 4 cylinder cars average almost 10 more miles per gallon than v8s.

Visual 4

boostedVehicles <- myData %>% 
  mutate(noTurbo = is.na(Turbocharger)) %>%
  mutate(noSuperCharger = ifelse(Supercharger == 'S', FALSE, TRUE)) %>%
  mutate(hasBoost = (ifelse(noTurbo== FALSE | noSuperCharger == FALSE , 'BOOST!', 'No Boost'))) %>%
  filter(hasBoost == 'BOOST!')
#getting only vehicles with boost

boostedVehicles %>%
  group_by(Year) %>%
  summarise(avergaeCombinedMPG = mean((City.MPG..FT1. + Highway.MPG..FT1.)/2)) %>%
  ggplot(aes(x = Year, y = avergaeCombinedMPG))+
  geom_line()+
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = 'Average MPG per Year for Boosted Cars', y = 'Combined MPG', x = 'Year')

  #seeing MPG vs Time for boosted cars

The line graph shows the difference over time of the average MPG for boosted cars. ‘Boosted cars’ refers to cars that have either a Turbocharger or Supercharger. These are devices that increase an engine’s internal combustion (they push air into the engine) and therefore increase power. The line graph shows that as time goes on, the average MPG increases. We can infer that the increase in average MPG is a result of advancements in technology that allow for better fuel efficiency than in previous years. From 2007 to today, the average MPG has increased by almost 5 miles.

Visual 5

myData.features <- myData%>%
  select(City.MPG..FT1., Highway.MPG..FT1., Unadjusted.Highway.MPG..FT1. , Unadjusted.City.MPG..FT1. )


myClusters <- kmeans(myData.features,3)


# Viewing Clusters
table(myData$Class, myClusters$cluster)
##                                     
##                                         1    2    3
##   Compact Cars                          8 3851 1649
##   Large Cars                           32  533 1326
##   Midsize Cars                         13 2198 2184
##   Midsize Station Wagons                1  217  305
##   Midsize-Large Station Wagons          0  215  441
##   Minicompact Cars                      7  554  699
##   Minivan - 2WD                         0   39  303
##   Minivan - 4WD                         0    0   47
##   Small Pickup Trucks                   0  152  386
##   Small Pickup Trucks 2WD               0  117  319
##   Small Pickup Trucks 4WD               0   10  208
##   Small Sport Utility Vehicle 2WD       5  337   61
##   Small Sport Utility Vehicle 4WD       0  346  180
##   Small Station Wagons                  6 1149  344
##   Special Purpose Vehicle               0    1    0
##   Special Purpose Vehicle 2WD           2   98  513
##   Special Purpose Vehicle 4WD           0    9  293
##   Special Purpose Vehicles              0  139 1316
##   Special Purpose Vehicles/2wd          0    0    2
##   Special Purpose Vehicles/4wd          0    0    2
##   Sport Utility Vehicle - 2WD           6  393 1228
##   Sport Utility Vehicle - 4WD           0  217 1865
##   Standard Pickup Trucks                0   27 2327
##   Standard Pickup Trucks 2WD            0   39 1138
##   Standard Pickup Trucks 4WD            0    1  985
##   Standard Pickup Trucks/2wd            0    0    4
##   Standard Sport Utility Vehicle 2WD    0   20  162
##   Standard Sport Utility Vehicle 4WD   10   58  366
##   Subcompact Cars                      16 2918 1938
##   Two Seaters                          13  678 1195
##   Vans                                  0   13 1128
##   Vans Passenger                        0    0    2
##   Vans, Cargo Type                      0    1  437
##   Vans, Passenger Type                  0    0  311
#Plot our results to see what we get
plot(myData[c("City.MPG..FT1.","Highway.MPG..FT1.")], col = myClusters$cluster)

After performing the K-means algorithm on the City and Highway MPG, I found that there are 3 optimal clusters to this data. I ran the algorithm with 2 and 4 clusters and found too many data points were being held in 2 clusters, and too little data points were found with 4 clusters. The most data points were Compact Cars, Midsize Cars, and Subcompact Cars because these are the most common type of vehicles.

Part Two - Predicitve Analysis

Partioning the Data

carsData <- myData %>%
  select(Vehicle.ID, Year, Make, Model,Class, Drive, Transmission, Engine.Cylinders, Engine.Displacement,Turbocharger,Supercharger, Fuel.Type.1, City.MPG..FT1.,Highway.MPG..FT1., Annual.Fuel.Cost..FT1.)%>%
  drop_na()
#getting needed cols/ removing nulls

set.seed(1)
cars_split <- initial_split(carsData, prop = .75)

cars_training <- training(cars_split)

cars_testing <- testing(cars_split)

Predicting Highway MPG

training_model_Highway <- lm(formula = Highway.MPG..FT1. ~  Year+ Engine.Cylinders+ Engine.Displacement+ City.MPG..FT1.+ Highway.MPG..FT1.+ Annual.Fuel.Cost..FT1., data = cars_training) #predicting mpg using the selected columns (integers only)

summary(training_model_Highway)

ggplot(data = training_model_Highway) +
  aes(x = training_model_Highway$residuals)+
  geom_histogram()+
  labs(x = 'Highway MPG Training Residuals', y = 'Count', title = 'Distribution of Highway MPG LM Residuals')

#the distribution is minimal. high freq. at zero for residuals. errors are close to 0 and symmetric

The Highway MPG linear regression model has a nice symmetric distribution of residuals. The visual shows that the residuals have a high frequency around zero. Having errors close to zero, with the median residual of 0.0645, proves the original model is a good fit.

Predicting City MPG

training_model_City <- lm(City.MPG..FT1. ~ Year+ Engine.Cylinders+ Engine.Displacement+ City.MPG..FT1.+ Highway.MPG..FT1.+ Annual.Fuel.Cost..FT1., data = cars_training) #building the model with select cols

summary(training_model_City)
## 
## Call:
## lm(formula = City.MPG..FT1. ~ Year + Engine.Cylinders + Engine.Displacement + 
##     City.MPG..FT1. + Highway.MPG..FT1. + Annual.Fuel.Cost..FT1., 
##     data = cars_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9361 -0.8851 -0.1360  0.7841  8.2380 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            46.1668869  4.9295538   9.365 < 0.0000000000000002 ***
## Year                   -0.0191884  0.0025032  -7.666   0.0000000000000223 ***
## Engine.Cylinders       -0.1080673  0.0293823  -3.678             0.000238 ***
## Engine.Displacement    -0.0295729  0.0508686  -0.581             0.561033    
## Highway.MPG..FT1.       0.5944553  0.0084889  70.027 < 0.0000000000000002 ***
## Annual.Fuel.Cost..FT1. -0.0018750  0.0001094 -17.142 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.35 on 3921 degrees of freedom
## Multiple R-squared:  0.9048, Adjusted R-squared:  0.9047 
## F-statistic:  7452 on 5 and 3921 DF,  p-value: < 0.00000000000000022
ggplot(data = training_model_City) +
  aes(x = training_model_City$residuals)+
  geom_histogram(bins = 20)+
  labs(x = 'City MPG Training Residuals', y = 'Count', title = 'Distribution of City MPG LM Residuals')

The City MPG linear regression model has errors close to zero. This shows the model predicts accurately (no under/over bias). With a median residual of -0.136, the residuals are symmetrically distributed around zero. However, the model signifies that a car’s engine displacement variable is not a valuable addition to the model with a p-value of 0.561. Also this model does have some outlying residuals.

Class of a Vehicle Decision Tree Model

newData <- myData %>%
  select(-Vehicle.ID) %>%
  filter(Make == "Ford") %>%# filtering Ford cars from 1990+
  filter(Year >= 1990) %>%
  mutate(Class = ifelse(Class == c("Subcompact Cars", "Compact Cars", "Midsize Cars"), c("Subcompact Cars", "Compact Cars", "Midsize Cars"), "Other")) %>%
  mutate(Class = as.factor(Class))


set.seed(1)

split <- initial_split(newData, prop = 0.7)
training_data <- training(split)
validation_data <- testing(split)
#splitting data

class_tree <- rpart(Class ~ Engine.Cylinders + City.MPG..FT1., data = training_data, parms = list(split = "gini"), method = "class", control = rpart.control(cp = 0, minsplit = 1, minbucket = 1))
#building the tree model wt cylinders & city mpg

prp(class_tree, faclen = 0, varlen = 0, cex = 0.75, yesno = 2)

Using the decision tree I am able to classify based on engine cylinders and city miles per gallon, whether a vehicle is a compact car, subcompact car, or midsize car class. According to the model, Compact Cars consists of 4 cylinder engines and at least 30 mpg in the city. Vehicles with engines larger than 5 cylinders and get between 20 to 22 city mpg, are classified as Midsize cars. Vehicles with engines of greater than 7 cylinders and a city MPG of more than 18 are classified as Subcompact cars. Looking into the prediction test by utilizing the Confusion Matrix, I am able to see that this decision tree is highly effective in determining if vehicles fall into the “Other” category (not compact, subcompact, or midsize) in vehicle class.

prediction_test <- predict(class_tree, newdata = training_data, type = "class")
prediction_test1 <- predict(class_tree, newdata = validation_data, type = "class")

#View(as.data.frame(prediction_test))

confusionMatrix(prediction_test, training_data$Class) #to see how right your prediction test is
## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Compact Cars Midsize Cars Other Subcompact Cars
##   Compact Cars               2            0     0               0
##   Midsize Cars               0            1     0               0
##   Other                     45           36  1460              49
##   Subcompact Cars            0            0     1               2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9179          
##                  95% CI : (0.9034, 0.9309)
##     No Information Rate : 0.9154          
##     P-Value [Acc > NIR] : 0.3807          
##                                           
##                   Kappa : 0.0664          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Compact Cars Class: Midsize Cars Class: Other
## Sensitivity                     0.042553           0.0270270      0.99932
## Specificity                     1.000000           1.0000000      0.03704
## Pos Pred Value                  1.000000           1.0000000      0.91824
## Neg Pred Value                  0.971769           0.9774295      0.83333
## Prevalence                      0.029449           0.0231830      0.91541
## Detection Rate                  0.001253           0.0006266      0.91479
## Detection Prevalence            0.001253           0.0006266      0.99624
## Balanced Accuracy               0.521277           0.5135135      0.51818
##                      Class: Subcompact Cars
## Sensitivity                        0.039216
## Specificity                        0.999353
## Pos Pred Value                     0.666667
## Neg Pred Value                     0.969240
## Prevalence                         0.031955
## Detection Rate                     0.001253
## Detection Prevalence               0.001880
## Balanced Accuracy                  0.519284
confusionMatrix(prediction_test1, validation_data$Class)
## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Compact Cars Midsize Cars Other Subcompact Cars
##   Compact Cars               0            0     1               0
##   Midsize Cars               0            0     1               0
##   Other                     22           11   624              25
##   Subcompact Cars            0            0     0               1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9124          
##                  95% CI : (0.8887, 0.9325)
##     No Information Rate : 0.9139          
##     P-Value [Acc > NIR] : 0.5879          
##                                           
##                   Kappa : 0.0269          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Compact Cars Class: Midsize Cars Class: Other
## Sensitivity                      0.00000             0.00000      0.99681
## Specificity                      0.99849             0.99852      0.01695
## Pos Pred Value                   0.00000             0.00000      0.91496
## Neg Pred Value                   0.96784             0.98392      0.33333
## Prevalence                       0.03212             0.01606      0.91387
## Detection Rate                   0.00000             0.00000      0.91095
## Detection Prevalence             0.00146             0.00146      0.99562
## Balanced Accuracy                0.49925             0.49926      0.50688
##                      Class: Subcompact Cars
## Sensitivity                         0.03846
## Specificity                         1.00000
## Pos Pred Value                      1.00000
## Neg Pred Value                      0.96345
## Prevalence                          0.03796
## Detection Rate                      0.00146
## Detection Prevalence                0.00146
## Balanced Accuracy                   0.51923